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ABSTRACT  OF  THE  THESIS 


An  Application  of  Adaptive  Learning 
to  Malfunction  Recovery 
by 

Robert  Edward  Cruz 
Master  of  Science  in  Computer  Science 
University  of  California,  Los  Angeles,  1985 
Professor  Jacques  J.  Vidal,  Chair 

A  self-organizing  controller  is  developed  for  a  simplified  two-dimensional 
aircraft  model.  The  controller  learns  how  to  pilot  the  aircraft  through  a  navi¬ 
gational  mission  without  exceeding  pre-established  position  and  velocity  lim¬ 
its.  The  controller  pilots  the  aircraft  by  activating  one  of  eight  directional  ac¬ 
tuators  at  all  times.  By  continually  monitoring  the  aircraft’s  position  and 
velocity  with  respect  to  the  mission,  the  controller  progressively  modifies  its 
decision  rules  to  improve  the  aircraft’s  performance.  When  the  controller  has 
learned  how  to  pilot  the  aircraft,  two  actuators  fail  permanently.  Despite  this 
malfunction,  the  controller  regains  proficiency  at  its  original  task.  The  exper¬ 
imental  results  reported  show  the  controller’s  capabilities  for  self-organizing 
control,  learning,  and  malfunction  recovery. 
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Introduction 


Objectives 

As  aircraft  technology  advances  in  complexity,  piloting  an  aircraft  is 
becoming  more  difficult  and  subject  to  error.  This  difficulty  can  be  critical 
during  an  in-flight  malfunction,  risking  the  loss  of  both  the  pilot  and  aircraft. 
In  these  situations,  it  is  important  to  devise  automated  assistance  for  the  pi¬ 
lot.  With  this  goal  in  mind,  UCLA,  and  the  NASA  Dryden  Flight  Research 
Facility  are  developing  expert  systems  for  potential  onboard  use  in  future  air¬ 
craft.  The  research  presented  in  this  thesis,  while  a  long  way  from  satisfying 
the  goal,  represents  an  initial  step  towards  its  achievement. 

The  immediate  objective  of  this  study  is  to  develop  a  controller  that 
learns  an  aircraft  task  and  recovers  when  the  aircraft  malfunctions.  A  com¬ 
puter  piogram  is  used  to  simulate  both  the  controller  and  the  aircraft.  Given 
limited  a  priori  information  and  a  trial-and-error  learning  strategy,  the  con¬ 
troller  learns  to  navigate  a  two-dimensional  aircraft  through  a  pre-established 
mission.  The  controller  uses  performance  feedback  that  is  taken  during  and 
after  each  aircraft  flight.  Because  its  learning  strategy  is  independent  of 
flight  dynamics,  the  model  can  be  applied  to  both  normal  and  abnormal  flight 
situations. 

In  essence,  the  controller  decomposes  the  problem  into  mutually  isolated 
subproblems  corresponding  to  different  regions  of  the  aircraft’s  allowable  state 
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space.  For  each  subproblem,  the  controller  implements  the  same  problem¬ 
solving  algorithm.  The  resulting  solutions  to  each  subproblem  contribute  to 
the  accomplishment  of  the  overall  flight  task.  In  this  manner,  the  controller 
produces  useful  results  for  a  problem  involving  a  relatively  large  search  space. 
Additionally,  the  decomposition  technique  lends  itself  to  faster  computation 
possibilities  related  to  parallel  processing  implementations. 


Previous  Work 

The  research  leading  to  the  present  work  centers  on  controllers  designed 
for  the  cart-pole  system  shown  in  Figure  1. 


Figure  1.  Cart-Pole  System 

The  system  consists  of  a  rigid  pole  mounted  to  the  top  of  a  motorized  cart. 
The  cart  moves  in  two  directions,  left  and  right,  along  a  st  ight  track  of 
fixed  length.  The  pole  is  hinged  to  the  cart  so  that  it  rotates  only  in  the  vert¬ 
ical  plane  bounding  the  cart’s  motion.  The  controller  moves  the  cart  by  ap- 


plying  a  constant-force  motor  either  to  the  left  or  to  the  right.  The  cart-pole 
system  is  inherently  unstable.  Therefore,  the  controller’s  task  is  to  keep  the 
pole  from  falling  by  continually  moving  the  cart  left  and  right  as  appropriate. 

The  cart-pole  system  was  initially  devised  by  Donaldson  [4]  in  1960.  In 
his  work,  Donaldson  designs  an  automaton  that  learns  the  cart-pole  balancing 
task  by  comparing  its  control  movements  to  those  of  a  human.  This  learning 
strategy,  using  the  terminology  of  Carbonell,  et.  al.  [3j  is  called  "learning  by 
example."  The  human  assumes  the  role  of  a  teacher  who  provides  examples 
for  the  automaton  to  imitate. 

In  1964,  Widrow  and  Smith  [13]  designed  a  controller  that  could  be 
trained  to  effectively  balance  the  pole.  It  consists  of  an  encoder  and  an  adap¬ 
tive  linear  element,  or  Adaline.  The  encoder  generates  patterns  based  on  the 
values  of  four  variables  that  describe  the  cart-pole  system  state: 

x  :  the  position  of  the  cart  on  the  track, 

9  :  the  angle  of  the  pole  with  the  vertical, 
x  :  the  velocity  of  the  cart,  and 
9  :  the  angular  velocity  of  the  pole. 

The  encoding  scheme  partitions  each  variable  into  discrete  intervals.  Conse¬ 
quently,  each  pattern  represents  a  different  combination  of  intervals  occupied 
by  the  state  values. 

The  Adaline  produces  a  weighted  sum  from  the  encoded  patterns.  If  the 
sum  is  greater  than  or  equal  to  a  certain  threshold  value,  the  controller  ap¬ 
plies  the  cart’s  motor  to  the  right;  otherwise,  it  applies  the  motor  to  the  left. 

The  controller  learns  to  balance  the  pole  by  adjusting  the  Adaline’s 
weights  according  to  an  observer’s  periodic  assessment  of  the  controller’s  per- 


formance.  When  performance  improves,  it  changes  the  weights  to  reinforce 
the  Adaline’s  decision  logic.  Conversely,  when  performance  degrades,  it  ad¬ 
justs  the  weights  so  that  the  decision  logic  is  reversed.  When  the  observer 
cannot  distinguish  a  change  in  performance,  the  weights  are  left  unchanged. 
Widrow  and  Smith  refer  to  this  learning  technique  as  "selective  bootstrap¬ 
ping."  Though  it  does  not  learn  by  examples,  it  still  requires  a  human  ob¬ 
server  to  assess  its  performance. 

In  1968,  Michie  and  Chambers  [6],  (7)  presented  an  autonomous  controller 
for  the  cart-pole  problem.  Its  learning  strategy,  using  Carbonell,  et.  al.  [3] 
terminology  again,  is  one  of  "learning  from  observation  and  discovery."  Both 
the  controller  and  the  cart-pole  system  are  simulated  by  a  computer  program 
called  Boxes.  The  name  derives  from  the  method  used  to  partition  the  cart- 
pole  state  space.  In  Michie  and  Chambers’  representation,  the  state  variables 
are  plotted  along  four  mutually  orthogonal  axes.  Consequently,  each  system 
state  corresponds  to  a  unique  point  in  the  4-dimensional  state  space.  By  us¬ 
ing  Widrow  and  Smith’s  scheme  of  partitioning  the  state  variables  into  inter¬ 
vals,  the  state  space  divides  into  discrete  regions,  or  boxes. 

A  "demon"  resides  in  each  box.  Each  demon  decides  the  controller’s  out¬ 
put  when  the  cart-pole  state  enters  its  box.  By  tabulating  the  consequences 
of  their  decisions,  the  demons  learn  the  best  controller  output  for  each  cart- 
pole  state.  Hence,  the  controller  automatically  assesses  its  performance  and 
adjusts  its  decisions  so  that  it  eventually  learns  its  task. 

In  1982,  Barto,  Sutton,  and  Anderson  [2]  presented  a  similar  program  for 
the  cart-pole  problem.  Their  aim  was  to  show  how  the  cart-pole  controller 
could  be  built  with  neuron-like  adaptive  elements  that  they  had  developed. 
The  controller  consists  of  a  single  Associative  Search  Element  (ASE)  and  a 


single  Adaptive  Critic  Element  (ACE).  Both  elements  rely  on  the  state  space 
representation  used  by  Michie  and  Chambers.  The  ASE  utilizes  adaptive 
threshold  logic  to  control  the  cart’s  movement.  Its  thresholds  are  modified 
according  to  reinforcement  feedback  provided  by  the  ACE.  The  ACE  pro¬ 
duces  the  feedback  by  applying  threshold  logic  to  the  consequences  of  each 
controller  output.  Barto,  et.  al.  showed  that  their  controller  performs 
significantly  better  than  the  one  designed  by  Michie  and  Chambers. 

In  1966,  Schaefer  and  Cannon  (10)  showed  that  the  cart-pole  problem  gen¬ 
eralizes  to  an  infinite  sequence  of  problems  of  increasing  difficulty,  with  1,  2, 
3,  etc.,  poles  balanced  each  on  top  of  the  other.  The  controller  to  be 
described  represents  a  different  generalization  of  the  problem.  Whereas  mo¬ 
tion  in  the  cart-pole  system  is  one-dimensional,  it  provides  control  for  a  two- 
dimensional  system.  Consequently,  this  research  lays  the  groundwork  for  fu¬ 
ture  work  on  automatic  control  in  two-  and  three-dimensional  systems. 


Outline  of  the  Paper 

The  organization  of  this  paper  has  been  divided  into  three  major  sections. 
In  the  first,  the  Boxes  method  is  built  into  a  controller  for  a  two-dimensional 
aircraft  model.  The  controller  is  exercised  in  three  simulation  experiments. 
In  the  first  experiment,  the  controller  is  designed  with  a  learning  strategy 
similar  to  Michie  and  Chambers’.  Afterward,  the  controller  is  enhanced  with 
adaptive  elements  performing  functions  similar  to  the  ASE  and  ACE  designed 
by  Barto,  et  d  In  the  second  section,  two  more  experiments  are  conducted. 
T'  •  .  o  study  the  controller’s  ability  to  pilot  the  aircraft  after  the 

an  *s  The  last  section  is  devoted  to  a  discussion  of  the 
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Experiments  in  Adaptive  Control 


In  the  following  experiments,  a  controller  is  developed  for  a  simplified 
two-dimensional  aircraft  model.  The  aircraft’s  environment  consists  of  pre- 
established  boundaries  on  its  flight  position  and  velocity  with  respect  to  a 
two-dimensional  Cartesian  coordinate  system  (Figure  2). 


Figure  2.  2-Dimensional  Aircraft  in  a  Position- Velocity  World 


The  aircraft  is  equipped  with  force  actuators  that  provide  constant  ac¬ 
celeration  in  eight  directions  with  respect  to  the  center  of  its  vertical  plane  of 
motion:  up,  down,  left,  right,  up-left,  up-right,  down-left,  and  down-right. 
These  actuators  correspond  to  the  bi-directional  motor  used  in  the  cart-pole 
problem.  Therefore,  the  controller  has  been  designed  to  activate  only  one  ac¬ 
tuator  at  a  time.  The  aircraft  enters  a  failure  state  when  it  flies  outside  of  its 
position  boundaries  or  exceeds  maximum  speed  limits.  These  restrictions 


correspond  to  the  cart  running  off  an  end  of  its  track  or  the  pole  falling.  A 
flight  succeeds  when  the  controller  maintains  flight  within  position  and  veloci¬ 
ty  limits  for  a  predetermined  amount  of  time. 

The  controller’s  design  has  been  adapted  from  the  Boxes  system  developed 
by  Michie  and  Chambers  [6],  (7j.  The  exact  details  will  be  described  in  the 
following  sections. 

Discretization  of  Aircraft  States 

At  any  point  in  time,  the  current  aircraft  state  is  defined  by  four  vari¬ 
ables: 

x  :  the  aircraft’s  position  on  the  X  axis, 
y  :  the  aircraft’s  position  on  the  Y  axis, 
x  :  the  aircraft’s  velocity  along  the  X  axis,  and 
y  :  the  aircraft’s  velocity  along  the  Y  axis. 

These  variables  correspond  to  the  variables  x,  0,  x,  and  0  which  defined  the 
cart-pole  system  state.  The  variables  are  plotted  along  four  mutually  orthog¬ 
onal  axes.  This  orientation  defines  a  four-dimensional  state  space.  Each  air¬ 
craft  state  is  represented  by  a  point  in  this  space.  To  differentiate  between 
aircraft  states,  the  four  state  variables  are  divided  into  value  ranges  creating 
discrete  thresholds  for  the  state  values.  (Figure  3). 

The  proper  threshold  values  are  dependent  on  performance  characteristics 
of  the  model  aircraft  and  its  mission.  For  the  next  three  experiments,  the 
thresholds  shown  in  Figure  3  have  been  selected.  The  variables  x  and  y  are 
partitioned  into  five  allowable  ranges  by  thresholds  at  10,  30,  and  50  meters 
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Figure  3.  Range-Coded  Aircraft  State  Variables 


in  both  the  plus  and  minus  directions  with  respect  to  the  coordinate  origin. 
Values  for  x  or  y  of  magnitude  greater  than  50  meters  signal  an  aircraft 
failure.  Similarly,  the  flight  velocity  variables  x  and  y  are  divided  into  three 
distinct  ranges  by  symmetric  thresholds  at  2  and  10  meters  per  second. 
Again,  values  for  x  and  y  of  magnitude  greater  than  10  m/s  constitute  an  air¬ 
craft  failure.  Thresholding  the  state  variables  thus  "lumps  together"  closely- 
related  aircraft  states  such  that  the  four-dimensional  state  space  within  which 
the  aircraft  operates  becomes  subdivided  into  5X&X3X3—225  distinct  re¬ 
gions,  or  "boxes."  Using  the  Boxes  framework,  the  controller’s  task  may  be  re¬ 
garded  as  maintaining  the  four  state  variables  within  their  limits  so  that  the 
current  aircraft  state  falls  within  one  of  the  225  boxes  at  all  times. 


Force  Actuator  Activation 

For  simulation  purposes,  the  aircraft’s  flight  has  been  time-sliced  into  1- 
second  intervals.  During  each  interval,  the  controller  activates  a  force  actua¬ 
tor.  For  the  experiments  that  follow,  each  actuator  has  been  "designed"  to 
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provide  1.5  newtons  of  constant  thrust  in  one  of  the  eight  directions. 
Depending  on  the  direction  in  which  it  is  applied,  the  activation  of  an  actua¬ 
tor  changes  the  aircraft's  current  position  and  velocity  and,  thus  determines  a 
new  aircraft  state.  In  this  fashion,  each  actuator  activation  serves  as  a  transi¬ 
tional  operator  that  moves  the  aircraft  from  one  box  to  another  within  its  al¬ 
lowable  state  space. 

Problem  Decomposition  Using  Demons 

To  solve  its  problem,  the  controller  must  learn  to  avoid  (sequences  of)  ac¬ 
tions  that  lead  to  an  aircraft  failure.  Obviously,  certain  actions  are  appropri¬ 
ate  in  some  instances  and  inappropriate  in  others.  Because  the  controller 
does  not  have  a  built-in  model  of  its  environment,  it  must  learn  by  trial  and 
error  the  proper  actuator(s)  to  activate  in  a  given  situation. 

Recall  that  partitioning  the  state  variables  has  created  a  four-dimensional 
state  space  with  225  regions,  or  boxes.  For  illustrative  purposes,  imagine  that 
these  boxes  are  inhabited  by  "local  demons"— one  per  box— all  of  which  are 
under  the  supervision  of  a  "global  demon"  (Figure  4).  The  global  demon  is  in 
charge  of  the  overall  flight  task.  The  local  demons  concern  themselves  only 
with  aircraft  flight  when  the  aircraft  state  enters  their  box.  Upon  entry  into 
a  box,  the  local  demon  must  decide  which  of  the  aircraft’s  eight  actuators  to 
activate  next.  After  making  its  decision,  the  local  demon  informs  the  global 
demon  who,  in  turn,  activates  the  apprpriate  actuator.  After  the  actuator 
has  been  activated  for  a  unit  time-step,  the  global  demon  determines  the  new 
box  within  which  the  aircraft  state  now  resides,  and  asks  the  corresponding 
local  demon  which  force  to  activate  next.  This  sequence  continues  until  the 


Figure  4.  Network  of  Demons  for  the  Aircraft  State  Space 

aircraft  enters  a  failure  state,  thus  ending  the  trial  run. 

The  use  of  global  and  local  demons  exemplifies  the  problem-solving  tech¬ 
nique  of  problem  decomposition  into  subproblems.  In  order  to  solve  the 
overall  problem,  the  global  demon  divides  it  equally  into  225  smaller  ones  and 
delegates  their  solutions  to  the  local  demons.  Because  each  demon  oversees  a 
separate  region  of  the  aircraft  state  space,  its  job  is  to  determine  which  force 
setting  best  avoids  aircraft  failure  when  the  current  state  falls  within  its  as¬ 
signed  region. 

In  order  to  carry  out  its  task,  each  local  demon  records  its  previous  ex¬ 
perience  of  the  aircraft’s  flight  by  tabulating  the  following  data: 


Force  Lifetimes:  The  total  lifetime  of  decisions  to  activate  a 
force  actuator  in  a  given  direction.  A  force  lifetime  is  the 
difference  between  the  time  of  aircraft  failure  and  the  time  when 
the  aircraft  state  enters  a  box  and  the  local  demon  decides  to 
apply  the  force.  A  force’s  total  Lifetime  is  a  weighted  sum  of  all 
of  its  "individual"  lives  during  previous  runs. 

Force  Usages:  Weighted  sums,  for  each  force  direction,  of  the 
total  number  of  times  the  local  demon  decided  to  activate  a 
force  during  previous  runs. 

Entry  Times:  The  times  during  which  the  aircraft  state  en¬ 
tered  the  demon’s  box  during  the  current  run.  Time  is  initial¬ 
ized  to  1  at  the  beginning  of  a  run,  and  continues  in  1-second 
increments  until  aircraft  failure. 


Experimental  Procedures  and  Results 

Three  experiments  of  1000  simulated  flights  are  conducted.  Before  the 
first  run  of  each  experiment,  local  force  Lifetimes  and  Usages  are  initialized  to 
zero.  Additionally,  control  decisions  for  the  local  demons  are  determined  at 
random.  Each  run  begins  at  a  randomly- generated  initial  point  within  the 
aircraft’s  allowable  state  space.  The  run  terminates  when  the  aircraft  enters 
a  failure  state  or  avoids  failure  for  1200  time-steps.  Thus,  1200  seconds,  or  20 
simulated  minutes,  is  established  as  the  duration  of  a  successful  flight. 

The  objective  of  the  first  experiment  is  to  demonstrate  that  Michie  and 
Chambers’  Boxes  method  can  be  effectively  utilized  by  a  controller  for  a  sim¬ 
ple  aircraft.  This  objective  assumes  a  close  correspon dance  between  the  cart- 
pole  problem  and  the  current  aircraft  task.  Hence,  the  procedures  used  in 
Experiment  1  are  similar  to  those  outlined  by  Michie  and  Chambers  [6j. 
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Experiment  1  Procedures 


In  this  experiment,  local  demons  are  allowed  to  decide  on  only  one  force 
actuator  to  activate  per  run.  Therefore,  regardless  of  how  many  times  the 
aircraft  state  enters  a  demon’s  box  during  a  run,  the  demon’s  control  decision 
remains  the  same.  Initial  states  for  each  run  consist  of  randomly-generated 
values  for  x  and  y  between  ±30m  and  values  for  x  and  y  between  ±2m/s. 
This  initialization  procedure  restricts  the  initial  aircraft  state  to  nine  local 
demon  boxes  located  in  the  center  portion  of  the  aircraft  state  space. 

When  the  aircraft  state  enters  a  demon  box  during  the  first  run,  the  fol¬ 
lowing  actions  occur: 

1.  The  local  demon  records  the  time  of  entry. 

2.  The  local  demon  signals  the  global  demon  to  activate  a  force  actuator. 
The  local  demon’s  decision  depends  on  tabulated  experience  of  the  conse¬ 
quences  of  its  previous  decisions.  However,  during  the  first  run,  this  decision 
is  generated  at  random. 

As  these  actions  continue,  the  aircraft  state  transitions  from  one  demon  box 
to  another  until  it  finally  reaches  a  failure  state.  This  event  terminates  a  trial 
run  and  triggers  the  following  actions: 

1.  The  global  demon  informs  the  local  demons  that  an  aircraft  failure  has 
occurred. 

2.  Each  local  demon  updates  its  eight  pairs  of  force  Lifetime  and  Usage 
totals.  Based  on  these  new  totals,  it  determines  which  force  actuator  to  ac¬ 
tivate  (via  the  global  demon)  for  the  duration  of  the  next  trial  run. 

If  a  force  actuator  was  active  before  the  aircraft  failed,  its  Lifetime  and  Usage 
values  are  calculated  as  follows: 

N 

Lifetime  =  Lifetime'  XDK  +  £]tf~tj 

i-l 
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where  N  =  the  number  of  times  that  the  aircraft  state  entered  the  demon 
box  during  the  run  that  just  failed,  and 

tf  and  tj  correspond,  respectively,  to  the  time  of  aircraft  failure,  and  the  indi¬ 
vidual  times  of  entry  into  the  demon  box. 

usage  =  Usage' XDK  +  N 

where  DK  =  0.99  is  a  constant  multiplier  less  than  unity  that  weights  recent 
experience  relative  to  earlier  experience. 

If  a  force  actuator  was  inactive  before  the  aircraft  failure,  its  Lifetime  and 
Usage  values  are  reduced,  respectively,  according  to  the  following  equations: 

Lifetime  =  Lifetime'  X  DK,  and 
Usage  =  Usage'  X  DK 

In  order  to  determine  which  actuator  to  activate  next,  the  local  demons 
refer  to  a  "target"  value  supplied  by  the  global  demon.  This  value  represents 
the  mean  lifetime  of  the  aircraft  for  all  previous  runs.  It  is  calculated  from 
the  global  demon’s  Lifetime  (GL)  and  Usage  (GU)  values  in  the  following 
manner: 

GL  =  GL'xDK+tf 

GU  =  GU'xDK+i 

,  *  GL 

target  =  — 

Using  the  global  target  value,  the  local  demons  assess  the  relative 
effectiveness,  RE,  of  each  of  their  eight  force  actuators.  RE  is  calculated  as 
follows: 


RE  = 


Lifetime + K  X  target 
Usage +K 
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where  K  =  20  is  a  multiplier  weighting  global  relative  to  local  experience. 

Incorporating  K  and  the  target  into  the  assessment  of  a  local  force  actuator 
serves  to  base  the  actuator’s  value  on  two  levels  of  experience:  global  experi¬ 
ence  from  the  aircraft’s  mean  lifetime  over  K  runs;  and  local  experience  from 
the  actuator’s  Lifetime  and  Usage  totals. 

Once  the  demon  has  calculated  the  relative  value  of  each  of  its  force  ac¬ 
tuators,  it  chooses  the  actuator  with  the  highest  value  as  the  one  to  activate 
during  the  next  trial  run  (see  footnote). 

Experiment  1  Results 

Because  a  pseudo-random  number  generator  was  used  to  generate  initial 
aircraft  states  and  decide  the  local  demons’  initial  control  decisions,  Experi¬ 
ment  1  was  conducted  ten  times,  each  time  with  a  different  initial  seed  value. 
The  average  results  for  the  ten  tests  are  plotted  in  Figure  5.  The  plot  shows 
the  average  target  value  versus  simulation  run  number  measured  after  every 
50  runs.  Notice  the  direct  relationship  between  the  controller’s  flight  experi¬ 
ence  and  the  aircraft’s  mean  lifetime.  An  important  statistic  not  portrayed  is 
the  number  of  successful  flights  per  experiment.  On  the  average,  41  flights 
out  of  a  thousand  were  successful. 

The  results  of  Experiment  1  demonstrate  that  the  Boxes  method  may  be 
used  for  the  control  of  a  simplified  model  aircraft.  However,  as  evidenced  by 
its  low  success  rate,  the  controller's  effectiveness  is  limited.  Because  local  de- 

Because  of  the  way  the  experiment  is  initialized,  strict  adherence  to  this 
decision  rule  results  in  the  local  demon  choosing  its  initial  force  actuator  time 
after  time.  Therefore,  the  rule  is  followed  only  after  a  warm-up  period  during 
which  each  force  actuator  is  randomly  selected,  or  sampled,  at  least  once. 


Figure  5.  Simulation  Results  for  Experiment  1 

cision  rules  are  updated  only  after  each  aircraft  failure,  the  demons  do  not  re¬ 
ceive  feedback  as  to  the  immediate  consequences  of  their  decisions.  Further¬ 
more,  restricting  the  local  demons  to  one  force  actuator  activation  per  run 
reduces  the  controller’s  flexibility. 

The  controller’s  performance  can  improve  by  removing  these  restrictions. 
The  approach  taken  here  will  be  described  in  the  Experiment  2.  It  entails 
making  design  modifications  to  the  present  aircraft  controller.  The 
modifications  involve  the  addition  of  two  adaptive  threshold-logic  elements 
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similar  in  function  to  those  proposed  by  Barto  et.  al.  [2].  Consequently,  the 
objective  of  the  next  experiment  is  to  improve  the  controller’s  flight  perfor¬ 
mance. 


Experiment  2  Procedures 

This  experiment  differs  from  the  first  one  in  three  respects: 

1.  An  Adaptive  Critic  Element,  or  ACE,  is  incorporated  into  the  controll¬ 
er. 

2.  An  Associative  Search  Element,  or  ASE,  is  incorporated  into  the  con¬ 
troller. 

3.  Local  demons  may  activate  more  than  one  force  actuator  per  run. 

Otherwise,  the  initialization  procedures,  discretization  of  aircraft  states,  and 
local  control  rules  remain  the  same  as  those  in  Experiment  1. 

The  purpose  of  the  ACE  and  ASE  is  to  facilitate  local  learning  by  con¬ 
stant  reinforcement  feedback.  Recall  that,  in  Experiment  1,  local  demons  had 
to  wait  for  a  failure  signal  and  target  value  from  the  global  demon  before  they 
could  update  their  force  actuator  values  and  make  a  new  control  decision. 
With  the  modified  controller,  the  ASE  updates  force  actuator  values  every 
time  the  aircraft  state  changes. 

Essentially,  the  function  of  the  ACE  is  to  compare  the  demon  box  occu¬ 
pied  by  the  current  aircraft  state  with  the  box  occupied  by  the  previous  one, 
and  report  its  findings  to  the  ASE  after  each  unit  time-step.  Demon  box 
comparisons  are  based  on  the  Lifetime  totals  for  the  force  actuators  activated 
by  the  "current"  local  demon  and  the  "previous"  one.  The  findings,  ?,  assume 
the  values  of  either  plus  or  minus  one.  If  the  currently  activated  actuator  has 
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a  Lifetime  as  good  as,  or  greater,  than  that  of  the  prevj 


if  not,  f  is  negative. 
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The  function  of  the  ASE  is  to  modify  local  demon  force  Lifetimes  in  light 
of  the  findings  supplied  by  the  ACE.  Modification  of  a  demon  Lifetime  as- 

„»Wn  i 

sumes  two  forms:  reinforcement  and  penalization.  Reinforcement  occurs 
when  f  is  positive,  while  penalization  corresponds  to  f  being  negative.  Be¬ 
cause  of  the  manner  in  which  the  ACE  calculates  ?,  good  local  demon  deci- 
-itihs  will  be  reirfl/orcejiJ,  while  poor  decisions  will  be  penalized.  Note  that  only 

..femons  whose  -f  tjpxes  have  been  entered  during  the  current  run  become 
> 

modified;  furthermore,  modification  only  applies  to  the  Lifetime  for  the 

X  • 

demon’s  currently  at  .tivated  actuator. 

After  each  unit  time‘|step,  a  local  force  Lifetime  is  modified  according  to 
the  following  equation: 

v 

Lifetime  =  Lifetime' +?X  **XeXLifetime' 

where  a  =  0.05  =  the  minimui  n  percentage  of  a  local  Lifetime  that  may  be 
reinforced/penalized,  and 

e  =  an  eligibility  trace  for  local  de  mon  modification. 

The  eligibility  trace  measures  the  influence  of  a  local  demon’s  actions  on 
reaching  the  current  aircraft  state.  Obviously,  the  actions  of  recently- entered 
demons  have  more  of  an  influence  than  those  of  distantly-entered  ones. 
Thus,  the  former  demons  will  have  a  higher  eligibility  trace  than  the  latter 
ones.  Eligibility  begins  at  100%  when  a  demon  box  is  first  entered,  and  do- 
r  creases  exponentially  in  the  following  manner: 

e  =  c'X# 

where  0  =  0.95  =  the  percentage  of  a  demon’s  influence  which  remains  after 
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each  simulation  time-step. 


Experiment  2  Results 

As  with  the  first  experiment,  Experiment  2  was  conducted  ten  times,  each 
time  with  a  different  initial  seed  value.  As  depicted  in  Figure  6,  the  aircraft’s 
mean  lifetime  was  greatly  improved  by  the  addition  of  the  ACE  and  ASE  to 
the  controller.  In  fact,  successful  flights  occurred  537  times  out  of  1000,  or 
for  53.7%  of  the  trials.  In  several  of  the  individual  simulations,  the  mean  sys¬ 
tem  lifetime  approached  the  upper  time  limit  of  1200  unit  time-steps. 

Because  the  controller’s  task  remained  the  same  from  Experiment  1  to  Ex¬ 
periment  2,  the  results  of  the  latter  experiment  may  be  attributed  to  the 
modifications  made  to  the  controller.  The  controller  can  now  make  a  different 
decision  each  time  the  aircraft  state  enters  a  local  demon  box  during  the  same 
run.  This  capability  enables  the  controller  to  recover  more  quickly  from  poor 
decisions.  Additionally,  the  controller  can  receive  immediate  feedback  con¬ 
cerning  the  consequences  of  local  demon  decisions.  This  feedback  helps  the 
controller  to  correlate  aircraft  performance  to  local  demons’  actions. 

The  results  of  Experiment  2  show  that  the  modified  controller  works  well 
at  the  task  to  which  it  was  originally  assigned.  What  happens,  though,  when 
the  controller  is  assigned  a  more  difficult  task?  In  the  next  experiment,  the 
controller’s  task  is  made  more  difficult.  The  ensuing  results  should  provide  an 
idea  of  the  relative  tolerance  of  the  controller  to  changes  in  task  difficulty. 
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Figure  6.  Simulation  Results  for  Experiment  2 


Experiment  3  Procedures 

This  experiment  studies  the  effect  on  aircraft  performance  of  starting  each 
run  from  anywhere  in  the  aircraft  state  space.  Therefore,  x  and  y  values  are 
randomly  selected  between  ±50m,  while  the  x  and  y  values  are  selected  from 
the  ±10m/s  range.  In  experiments  1  and  2,  initial  states  fell  within  only  nine 
possible  demon  boxes  corresponding  to  the  central  portion  of  the  state  space. 
In  this  experiment,  all  225  boxes  become  eligible  starting  points  for  a  trial 


run.  The  change  in  initialization  procedures  increases  problem  difficulty  by 
forcing  the  controller  to  map  control  actions  to  the  entire  aircraft  state  space. 
Other  than  this  difference,  all  operating  procedures  are  the  same  as  those 
used  in  Experiment  2. 

Experiment  3  Results 

Experiment  3  was  conducted  with  the  same  initial  seed  values  used  in  Ex¬ 
periments  1  and  2.  The  average  results  are  shown  in  Figure  7.  Notice  that 
aircraft  performance  is  reduced  by  the  addition  of  216  more  initial  states. 
Also,  the  average  success  rate  fell  to  11.7%.  These  results  show  that  the  con¬ 
troller  learns  quicker  when  its  starting  conditions  are  more  consistent.  Other¬ 
wise,  to  attain  the  same  performance  reached  in  Experiment  2,  a  longer  learn¬ 
ing  period,  i.e.  more  trial  runs,  are  required.  This  conjecture  was  not  tested. 

Comparison  of  Experimental  Results 

For  comparison  purposes,  the  average  results  of  all  three  experiments  have 
been  superimposed  onto  the  same  graph  in  Figure  8.  With  respect  to  the 
learning  curve  for  Experiment  1,  aircraft  performance  levels  out  after  the  first 
500  runs.  Consequently,  the  experience  gained  from  the  last  500  runs  is  not 
utilized.  The  primary  reason  for  this  inefficiency  concerns  the  (long)  time  in¬ 
tervals  between  modifications  to  local  demon  decision  rules.  In  Experiment  2, 
local  control  rules  are  modified  after  every  unit  time-step.  The  effects  on  air¬ 
craft  performance  are  evident  upon  inspection  of  the  experimental  results. 
However,  in  Experiment  3,  aircraft  performance  degrades.  This  result  is  a  na- 
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Figure  7.  Simulation  Results  for  Experiment  3 


tural  consequence  of  the  addition  of  216  more  starting  states. 


Summary 


A  controller  has  been  developed  for  the  adaptive  control  of  a  simplified 
model  aircraft.  Its  components  include: 

1.  A  global  demon  that  monitors  the  aircraft  state,  issues  appropriate 
messages,  and  activates  the  aircraft  force  actuators. 
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Figure  8.  Simulation  Results  for  Experiments  1-3 


2.  A  network  of  local  demons  corresponding  to  different  regions  of  the 
aircraft  state  space  that  advise  the  global  demon  of  the  appropriate  actuator 
to  activate  when  the  aircraft  state  enters  a  given  box.  The  local  demons  ta¬ 
bulate  data  relating  to  the  consequences  of  their  previous  control  decisions. 
This  data  is  used  to  make  future  control  decisions  that  are  implemented  by 
the  global  demon. 

3.  Two  adaptive  threshold-logic  elements,  the  ACE  and  ASE,  that  modify 
local  demon  control  rules  in  light  of  immediate  aircraft  feedback. 


Because  the  controller  learns  its  task  from  trial-and-error  experience, 
changes  in  the  structure  of  its  components  can  be  made  to  provide  for  the 
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control  of  a  more  specialized  flight  task.  Such  an  undertaking  is  described  in 
the  sequel. 
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Experiments  in  Malfunction  Recovery 


The  purpose  of  the  preceding  experiments  was  to  adapt  the  Boxes  system 
to  a  simple  flight  controller.  In  achieving  this  purpose,  the  experiments  pro¬ 
vide  background  for  the  experiments  that  follow.  Their  purpose  is  to  apply 
the  controller  to  a  specific  navigational  problem,  and  study  its  performance 
under  a  simulated  aircraft  malfunction.  In  the  current  context,  a  malfunction 
exists  when  the  aircraft  loses  operational  control  of  one  or  more  of  its  eight 
force  actuators.  An  important  assumption  is  that,  despite  the  malfunction, 
the  aircraft  maintains  sufficient  directional  control  to  accomplish  its  pre¬ 
defined  mission. 


Experiment  4  Problem  Description 

This  experiment  proceeds  in  two  phases.  In  Phase  One,  the  controller 
learns  to  pilot  the  aircraft  from  one  demon  box  to  another.  When  it  achieves 
proficiency  at  this  task,  Phase  One  ends  and  Phase  Two  begins.  At  this 
point,  an  aircraft  malfunction  is  simulated  by  removing  two  of  the  aircraft’s 
eight  force  actuators.  During  the  second  phase,  the  controller  learns  to  ac¬ 
complish  its  original  task  despite  the  loss  of  the  actuators. 
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At  the  beginning  of  each  run,  the  aircraft’s  position  and  velocity  vectors 
are  initialized  as  follows: 

x  :  >25m 
y  :  25m 
x  :  Om/s 
y  :  Om/s 

Using  the  above  values  for  the  aircraft  state  variables,  the  aircraft’s  initial 
configuration  is  represented  in  Figure  9. 


Figure  9.  Initial  Aircraft  State  for  Experiment  4 

From  this  initial  state,  the  controller  must  learn  to  pilot  the  aircraft  to 
the  center  box  of  the  disretized  state  space.  This  box  corresponds  to  x  and  y 
falling  within  the  0  ±10m  range,  and  x  and  y  having  values  between  0  ±2m/s. 
With  respect  to  the  left  half  of  Figure  9,  the  aircraft  must  fly  from  an  initial 
position  in  the  lower-left  region  of  its  "airspace"  to  the  center  region.  As  in 
the  preceeding  experiments,  an  aircraft  failure  occurs  when  the  aircraft 
exceeds  its  position  and  velocity  boundaries.  Thus,  a  trial  run  ends  when  the 
aircraft  reaches  either  the  goal  state  or  a  failure  state.  Trials  continue  until 
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the  aircraft  reaches  the  goal  state  00%  of  the  time.  At  this  point,  the  aircraft 
loses  operational  of  two  of  its  eight  force  actuators.  The  controller  must  then 
recover  from  this  malfunction  by  learning  to  complete  the  aircraft’s  mission 
using  only  six  actuators. 


Experiment  4  Procedures 

Recall  that  in  the  preceeding  experiments,  the  aircraft’s  mission  was  to 
prolong  flight.  Now,  its  mission  is  to  fly  from  one  demon  box  to  another.  Be¬ 
cause  the  mission  has  changed,  the  local  demons’  mutual  goal  of  maximizing 
*heir  expected  lifetimes  no  longer  applies.  Instead,  the  local  demons  must 
minimize  the  aircraft’s  expected  "distance"  to  the  goal.  To  fulfill  this  task,  lo¬ 
cal  demons  tabulate  the  following  data: 

Force  Distances:  Relative  approximations,  for  each  force  actuator,  of  the 
aircraft’s  distance  to  the  coordinate  origin. 

Force  Usages:  Sums,  for  each  force  actuator,  of  the  number  of  times  that 
the  local  demon  decided  to  activate  each  actuator  during  previous  trial  runs. 

To  increase  the  granularity  of  the  state  space,  thresholds  have  been  added 
at  ±6m/s  for  the  aircraft  state  variables  x  and  y.  The  resultant  value  ranges 
for  the  discretized  aircraft  state  space  are  shown  in  Figure  10.  Consequently, 
the  aircraft  state  space  divides  into  5X5X5X5=625  demon  boxes  instead  of 
the  previous  225. 

The  preceeding  discussion  outlined  two  necessary  design  modifications  to 
the  controller.  First,  the  information  processed  by  the  local  demons  has 
changed.  Second,  the  number  of  local  demons  has  increased.  Now,  the  se¬ 
quence  of  events  occurring  in  a  trial  run  will  be  explained. 
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Figure  10.  Range-Coded  State  Variables  for  Experiments  4  and  5 


During  each  unit  time-step,  the  following  actions  occur: 


1.  The  global  demon  signals  the  local  demon  whose  box  has  just  been  en¬ 
tered  by  the  current  aircraft  state. 

2.  If  the  box  has  never  been  entered  during  a  trial  run,  the  demon’s  eight 
force  Distances  are  initialized  with  the  Pythagorean  distance  between  the 
current  aircraft  position  and  the  coordinate  origin. 

3.  The  local  demon  decides  on  a  force  actuator  for  the  global  demon  to 
activate.  The  demon  makes  this  decision  at  random  until  each  of  the  force 
actuators  has  been  sampled  at  least  once.  Afterward,  the  demon  decides  on 
the  force  actuator  with  the  lowest  Distance:Usage  ratio.  When  the  demon  has 
made  its  decision,  it  increments  its  appropriate  Force  Usage  entry  by  one,  and 
informs  the  global  demon  of  its  decision. 

4.  The  global  demon  activates  the  appropriate  actuator,  which  causes  the 
aircraft  state  variables  to  change. 

5.  The  ACE  compares  the  current  aircraft  state  with  the  previous  one 
and  reports  its  findings,  ?,  to  the  ASE.  To  make  the  comparison,  the  ACE 
calculates  the  Pythagorean  distance  between  the  current  aircraft  position  to 
the  coordinate  origin.  If  the  current  distance  is  less  than  the  previous  one,  it 
sets  ?  to  1;  otherwise,  it  sets  r  to  -1. 

6.  The  ASE  modifies  the  appropriate  force  Distance  value  for  each  local 
demon  whose  box  has  been  entered  during  the  current  run.  It  modifies  Dis¬ 
tance  values  as  follows: 

Distance  =  Distance'+oX?  XeXDistance' 
where  a  =  0.1  and 


28 


e  =  e'X/? 

where  &  =  0.8  (see  footnote). 

After  the  ASE  modifies  the  local  force  Distance  entries,  the  simulated  time 
is  incremented  a  unit  step,  and  steps  1-6  are  repeated.  This  cycle  continues 
until  the  aircraft  reaches  either  a  success  or  a  failure  state.  Upon  success,  the 
ACE  issues  an  ?  value  of  1.  With  r,  the  ASE  modifies  eligible  force  Distance 
values  as  in  step  6  above,  except  that  it  uses  0.5  as  its  value  for  alpha.  Upon 
failure,  ?  =  1,  and  the  ASE  modifies  local  force  Distances  using  an  a  value  of 
3.  Consequently,  local  control  decisions  are  either  significantly  reinforced  (de¬ 
creased)  or  penalized  (increased)  to  reflect  the  end  result  of  the  trial  run. 
Afterward,  the  controller  re-initializes  the  aircraft  state  variables,  and  a  new 
trial  run  begins.  The  experiment  proceeds  in  50-run  increments.  When  at 
least  45  out  of  50  flights  are  successful,  Phase  One  ends. 

Phase  Two  begins  with  the  aircraft  losing  control  of  its  up-right  and 
down-left  force  actuators.  Despite  this  malfunction,  the  controller  must  re¬ 
gain  its  90%  proficiency  rate  for  the  original  aircraft  mission.  Its  control  deci¬ 
sions  for  the  six  remaining  actuators  are  influenced  by  the  local  force  Distance 
and  Usage  totals  gained  from  Phase  One. 

Because  of  the  selection  of  initial  and  goal  aircraft  states,  the  malfunction 
prevents  the  aircraft  from  flying  directly  toward  its  positional  goal.  Instead, 
it  must  combine  its  up  and  right  actuators  to  compensate  for  the  loss  of  the 
up-right  one.  Similarly,  it  must  combine  the  down  and  left  actuators  to  com¬ 
pensate  for  the  loss  of  the  down-left  one.  When  the  aircraft  again  flies  suc- 

For  a  further  explanation  on  the  meaning  of  or,  ?,  and  e,  refer  to  the 
preceding  subsection  entitled  "Experiment  2  Procedures." 
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cessful  missions  45  times  out  of  50,  Phase  Two  and  the  experiment  end. 


Experiment  4  Results 

Experiment  4  was  conducted  ten  times,  each  time  with  a  different  initial 
seed  value.  The  final  results  are  listed  in  Figure  11.  Notice  first  that,  regard¬ 
less  of  the  initial  seed  value,  each  test  achieves  the  90%  task  proficiency  rate 
in  both  phases  of  the  experiment.  This  result  demonstrates  the  controller’s 
capability  to  learn  a  navigational  task  under  both  normal  and  malfunction 
conditions.  However,  the  required  learning  time  for  each  phase  does  not  vary. 
This  result  was  not  expected. 
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Figure  11.  Simulation  Results  for  Experiment  4 


Initially,  Phase  One  was  expected  to  take  longer  to  complete  than  Phase 
Two.  Whereas  the  controller  begins  Phase  One  with  no  experience  of  the 
consequences  of  its  decisions,  it  begins  Phase  Two  with  the  Distance  and 
Usage  totals  gained  from  the  previous  phase.  Therefore,  its  initial  decisions  in 


Phase  Two  should  be  more  accurate  than  the  random  decisions  made  at  the 
beginning  of  Phase  One.  This  initial  accuracy  was  expected  to  reduce  the  re¬ 
quired  learning  time  in  Phase  Two. 

Two  factors  contributed  to  the  experimental  results.  First,  the  experi¬ 
ment  was  divided  into  50-run  increments.  At  the  end  of  each  increment,  the 
number  of  successful  missions  was  evaluated.  If  there  were  at  least  45,  the 
appropriate  phase  would  terminate.  In  these  terms,  the  average  time  required 
to  complete  each  phase  was  two.  Perhaps,  given  a  more  difficult  task  (i.e.  one 
that  takes  longer  to  complete),  the  flight  experience  gained  in  Phase  One 
would  have  been  reflected  in  a  shorter  learning  time  for  Phase  Two. 

The  second  factor  concerns  the  particular  actuators  that  malfunctioned. 
During  Phase  One,  the  controller  learned  that  the  up-right  actuator  moved 
the  aircraft  closest  to  the  goal  from  its  initial  position.  However,  this  actua¬ 
tor  was  inoperational  during  Phase  Two.  Consequently,  the  controller’s  "best 
choice"  in  the  first  phase  was  no  longer  an  alternative  in  the  second.  This 
condition  prevented  the  controller  from  effectively  utilizing  its  experience 
gained  in  Phase  One. 

To  test  the  validity  of  these  ideas,  Experiment  5  was  devised.  Its  aim  is 
to  study  the  effects  of  task  difficulty  and  actuator  malfunction  on  required 
learning  time.  Thus,  the  final  results  should  show  more  clearly  the  temporal 
relationship  between  Phases  One  and  Two. 


Experiment  6  Procedures 

In  this  experiment,  the  controller  again  learns  to  pilot  the  aircraft  to  the 
center  box  of  the  state  space.  However,  its  initial  position  and  velocity  no 
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longer  remain  constant.  At  the  beginning  of  each  trial  run,  values  for  the 
state  variables  x  and  y  are  randomly  generated  in  the  [-30,-10]  and  [10,30] 
ranges  while  x  and  y  values  are  generated  in  the  [-2,2]  range.  Consequently, 
the  initial  aircraft  state  falls  within  one  of  four  local  demon  boxes  surround¬ 
ing  the  central  region  of  the  aircraft  state  space. 

The  random  initialization  procedures  are  designed  for  two  purposes:  (1)  to 
increase  the  difficulty  of  the  controller’s  task,  and;  (2)  to  increase  the  accura¬ 
cy  of  the  controller’s  initial  Phase  Two  decisions.  To  clarify  this  last  point, 
realize  that  only  the  up-right  and  down-left  actuators  malfunction.  Thus,  in 
Phase  Two,  whenever  the  aircraft  begins  in  the  upper-left  and  lower-right  re¬ 
gions  of  its  airspace,  its  best  decision  alternatives— down-right  and  up-left, 
respectively— still  remain.  Thus,  half  of  the  controller’s  Phase  Two  decisions 
maximize  Phase  One  experience. 

Other  than  the  addition  of  random  initial  states,  the  experimental  pro¬ 
cedures  remain  the  same  as  those  employed  in  Experiment  4. 

Experiment  5  Results 

As  usual,  Experiment  5  was  conducted  ten  times.  The  results  are  shown 
in  Figure  12. 

Due  to  the  random  initialization  procedures,  these  results  vary  more  than 
those  of  Experiment  4.  In  Phase  One,  the  number  of  runs  required  for  com¬ 
pletion  ranges  from  250  to  750.  In  Phase  Two,  only  150  to  350  runs  are  need¬ 
ed.  These  results  show  that  the  controller  requires  more  trials  to  complete 
Phase  One  than  to  complete  Phase  Two.  Thus,  for  the  task  under  study,  the 
controller’s  required  learning  time  depends  on  its  prior  task  experience. 
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Figure  12.  Simulation  Results  for  Experiment  5 


Summary 


To  accommodate  the  aircraft’s  navigational  mission,  slight  modifications 
were  made  in  the  controller’s  original  design.  The  number  of  local  demons 
was  increased  from  225  to  625.  In  addition,  the  demons’  goal  of  maximizing 
force  lifetimes  was  changed  to  minimizing  force  distances.  Finally,  a  special¬ 
ized  reinforcement  strategy  was  added  to  conclude  each  trial  run.  Despite 
these  changes,  the  current  controller  still  possesses  the  basic  components  that 
comprised  the  original  design.  Thus,  the  controller  design  offers  flexibility  in 
its  application  to  simple  aircraft  tasks. 

More  important,  the  results  of  the  two  experiments  demonstrate  the 
controller’s  malfunction  recovery  capabilities.  Although  only  one  particular 
malfunction  was  studied,  the  controller’s  usefulness  extends  to  others.  Furth¬ 
ermore,  the  aircraft  malfunction  may  occur  at  any  instant  instead  of  "wait¬ 
ing"  for  the  controller  to  achieve  task  proficiency.  This  property  stems  from 


The  Controller  in  Perspective 


The  preceeding  experiments  describe  the  development  of  a  controller  to 
pilot  a  two-dimensional  model  aircraft.  Now  let  us  reflect  on  what  the  exer¬ 
cise  has  accomplished.  Most  significantly,  it  has  provided  a  general  frame¬ 
work  for  adaptive  control  that  addresses  the  issue  of  malfunction  recovery. 
Additionally,  it  demonstrates  the  controller’s  flexibility  by  applying  it  to  two 
aircraft  tasks.  Finally,  it  provides  an  idea  of  the  controller’s  tolerance  to 
different  initial  conditions. 

Though  its  effectiveness  has  only  been  studied  with  respect  to  a  simplified 
aircraft  model,  this  fact  is  of  secondary  importance  (see  footnote).  Instead, 
the  controller’s  primary  importance  derives  from  its  capability  to  recover  from 
malfunctions. 

With  these  ideas  in  mind,  let  us  examine  the  controller  from  a  general  per¬ 
spective. 


With  the  appropriate  flight  dynamics  equations,  and  control  actions 
corresponding  to  the  actual  deflections  of  an  aircraft’s  control  surfaces,  the 
controller  can  be  modified  to  pilot  more  advanced  aircraft  systems.  The 
modification  would  entail  changes  in  the  problem  space  definition  and  local 
decision  rules.  However,  the  model’s  basic  components  and  problem-solving 
strategy  would  remain  unchanged. 


Classical  Control  Systems 


Although  it  features  certain  properties  characteristic  of  a  classical  con¬ 
troller,  the  proposed  controller  is  fundamentally  different  from  a  classical  one. 
As  with  a  classical  controller,  the  proposed  controller  periodically  outputs  an 
actuating  signal  to  the  plant,  or  process,  that  it  controls.  However,  in  a  clas¬ 
sical  controller,  the  actuating  signals  are  pre-designed  to  correspond  to 
different  input  states.  In  this  sense,  the  classical  controller  "knows"  a  priori, 
the  operating  dynamics  of  the  controlled  process.  In  the  proposed  controller, 
the  operating  dynamics  of  the  aircraft  and  its  mission  are  not  known  before¬ 
hand.  Instead,  the  controller  must  learn,  by  trial  and  error  after  the  process 
begins,  the  correct  actuating  signals  to  issue  for  each  aircraft  state. 

Another  major  difference  involves  the  implementation  of  process  feedback. 
In  classical  control  systems,  feedback  takes  on  the  form  of  an  "error 
difference"  between  the  plant's  desired  and  actual  performance.  The  controll¬ 
er  uses  this  difference  to  adjust  its  output  so  that  the  error  is  reduced  in  sub¬ 
sequent  plant  execution.  In  the  proposed  control  system,  feedback  has  two 
forms,  both  of  which  differ  from  conventional  methods.  In  the  first  form, 
feedback  occurs  only  when  the  aircraft  enters  a  success  or  failure  state.  This 
feedback  signals  the  end  of  a  trial  run.  Depending  on  the  event  (success  or 
failure)  that  terminates  the  run,  the  controller  adjusts  its  local  decision  rules. 
In  the  second  form,  feedback  from  the  ACE  "predicts"  the  aircraft’s  future 
performance  based  on  a  comparison  between  the  current  and  previous  aircraft 
states.  The  ASE  uses  this  prediction  to  adjust  the  controller’s  logic  for  deci¬ 
sions  leading  to  the  current  aircraft  state.  Consequently,  this  feedback 
influences  the  process  only  when  a  previous  input  pattern  repeats  itself. 
Thus,  in  one  instance,  feedback  occurs  infrequently  and,  in  the  other,  its 


consequences  do  not  occur  immediately. 

The  primary  reason  for  the  controller’s  deviation  from  classical  control 
theory  arises  from  the  objectives  for  its  ultimate  use.  When  initially 
configured,  the  controller  can  theoretically  be  provided  with  the  exact  operat¬ 
ing  dynamics  of  its  plant.  However,  upon  the  occurrence  of  a  plant  malfunc¬ 
tion,  the  plant’s  operating  dynamics  will  change.  Consequently,  the 
controller’s  decision  logic  will  no  longer  remain  accurate.  As  such,  the  con- 
trolle(  process  will  fail  unless  the  controller  is  designed  to  anticipate  the  par¬ 
ticular  malfunction  conditions.  Unfortunately,  because  of  the  unpredictable 
nature  of  most  malfunctions,  this  capability  is  neither  feasible  nor  practical. 
In  this  respect,  a  controller  designed  in  the  classical  manner  will  not  suffice. 
Instead,  it  is  more  desirable  to  design  a  controller  capable  of  adapting  to  the 
conditions  prevalent  for  its  current  plant  configuration. 

Adaptive  Control 

Depending  on  its  context  in  this  paper,  the  term  "adaptive  control"  can 
take  on  two  potentially  confusing  meanings.  First,  it  can  describe  the  process 
by  which  the  controller  learns  to  pilot  the  aircraft  through  its  mission.  Alter¬ 
natively,  it  can  describe  the  way  the  controller  recovers  the  aircraft  from  a 
malfunction  so  that  the  aircraft  can  continue  its  mission.  Both  processes  are 
related  in  the  sense  that  the  same  control  task  must  be  accomplished  though 
the  plant  configuration  may  vary.  For  this  reason,  subsequent  references  to 
adaptive  control  will  convey  its  former,  more  common,  meaning.  As  a  final 
note,  realize  that  the  controller’s  malfunction  recovery  capabilites  derive 
directly  from  the  adaptive  control  method  that  it  employs. 


Id  general,  as  Truxal  [11]  explains, 


the  primary  interest  in  adaptive  control  lies  in  the  possibilities 
of  an  automatic  measurement  of  process  dynamics  and  of  an  au¬ 
tomatic  and  frequent  redesign  of  controller  characteristics. 

These  activities  are  present  in  the  proposed  controller.  Until  pre-established 
termination  conditions  are  met,  the  controller  continually  measures  the 
aircraft’s  position  and  velocity  vectors.  It  uses  these  measurements  to  pro¬ 
gressively  modify  its  local  decision  rules  with  respect  to  an  overall  perfor¬ 
mance  criterion.  As  a  result,  the  controller  is  able  to  adapt  to  the  aircraft’s 
operating  conditions  in  a  manner  that  enables  the  aircraft’s  performance  to 
improve. 


Learning  Systems 

Because  of  its  adaptive  nature,  the  controller’s  task  is  not  merely  one  of 
control  itself;  it  is  one  of  learning  to  control.  Thus,  to  completely  analyze 
the  controller,  one  must  consider  its  capacity  for  learning.  Learning  occurs  by 
continually  observing  and  tabulating  the  aircraft’s  performance.  From  these 
specific  observations,  the  controller  induces  general  conclusions  as  to  the 
proper  responses  for  different  classes  of  input  states.  The  learning  process  is 
then  reflected  in  the  manner  in  which  the  aircraft’s  measured  performance 
improves  with  time. 

As  a  machine  learning  paradigm,  the  controller  exemplifies  what  Carbonell 
et.  al.  [3]  call  "learning  from  observation  and  discovery."  However,  a  more 
precise  classification  comes  from  noting  the  functions  of  the  ACE  and  ASE. 
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These  adaptive  logic  elements  in  tandem  provide  what  Widrow  et.  al.  [12] 
call  "learning  with  a  critic."  In  this  process,  the  controller  learns  its  task  via 
qualitative  comparisons  resulting  from  the  application  of  an  overall  perfor¬ 
mance  criterion  to  the  outcome  of  its  decisions. 


Self-Organization 

Implicitly  related  to  the  controller’s  adaptive  control  and  learning  capabil¬ 
ities  is  a  desirable  property  known  as  "self-organization."  Because  the 
controller’s  design  assumes  no  a  priori  knowledge  of  the  aircraft’s  flight 
dynamics,  the  controller  must  learn  its  input-output  decision  logic  from  trial- 
and-error  experience.  As  it  accumulates  flight  dynamics  information,  the  con¬ 
troller  associates  correct  responses  for  each  input  state  such  that  a  map  is 
created  for  the  previously  unknown  problem  space.  Because  the  map  is  creat¬ 
ed  a  posteriori,  the  process  of  learning  to  pilot  the  aircraft  is  said  to  self- 
organize.  For  clarification  purposes,  Saridis  [8]  offers  two  definitions: 


Self-Organizing  Control  Process:  A  control  process  is  called 
"self-organizing"  if  reduction  of  the  a  priori  uncertainties  per¬ 
taining  to  the  effective  control  of  the  process  is  accomplished 
through  information  accrued  from  subsequent  observations  of 
the  accessible  inputs  and  outputs  as  the  control  process  evolves. 


Self-Organizing  Controller:  A  controller  designed  for  a  self¬ 
organizing  control  process  will  be  called  "self-organizing"  if  it  ac¬ 
complishes  on-line  reduction  of  the  a  priori  uncertainties  per¬ 
taining  to  the  effective  control  of  the  process  as  it  evolves. 


A  self-organizing  controller  is  necessary  as  long  as  the  actions  governing 
the  effective  control  of  the  given  process  are  not  provided  from  the  outset.  In 
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the  present  context,  this  restriction  arises  because  of  the  controller’s  intended 
use  for  aircraft  malfunction  recovery.  Because  of  the  unpredictable  nature  of 
malfunction  situations,  the  particular  conditions  prevalent  in  a  malfunction 
are  difficult  to  anticipate.  Therefore,  it  is  desirable  that  the  controller  learn 
the  particular  conditions  that  apply  to  a  given  situation.  As  an  added 
benefit,  the  controller  can  use  experience  gained  in  previous  situations  to  ac¬ 
celerate  its  recovery  time.  In  essence,  self-organization  renders  the  plant’s 
operating  conditions  transparent  to  the  controller. 


Malfunction  Recovery 

The  main  result  of  this  research  has  been  the  development  of  a  controller 
that  can  recover  in  the  event  of  a  plant  malfunction.  This  capability  was 
demonstrated  by  the  controller’s  performance  in  Experiments  4  and  5.  As 
mentioned  earlier,  the  controller’s  effectiveness  generalizes  to  other  malfunc¬ 
tions  providing  that  the  aircraft  maintains  enough  directional  control  to  fly 
its  mission.  For  these  reasons,  the  controller  may  be  classified  as  a  malfunc¬ 
tion  recovery  system. 

This  classification  does  not  give  the  controller  any  properties  that  have 
not  already  been  discussed.  Instead,  it  uniquely  differentiates  this  controller 
from  all  others  previously  presented  in  the  literature.  Whereas  other  controll¬ 
ers  have  been  designed  with  adaptive,  learning,  and  self-organizing  capabili¬ 
ties,  their  application  has  heretofore  been  limited  to  processes  running  under 
normal  operating  conditions.  The  present  controller  removes  this  restriction 
by  operating  effectively  even  after  a  plant  malfunction.  Because  controlled 
processes  are  rarely  immune  to  failure,  controllers  can  only  benefit  from  the 
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incorporation  of  this  capability. 


Limitations  of  the  Proposed  Controller 

A  characteristic  feature  of  self-organization  involves  the  controller  learn¬ 
ing  its  task  as  the  controlled  process  evolves.  Because  of  this  requirement,  the 
controller’s  performance  is  highly  dependent  on  the  specificity  of  its  feedback 
and  the  heuristics  used  to  induce  its  control  rules.  Similarly,  performance  will 
vary  depending  on  the  selection  of  an  appropriate  3tate  space.  In  light  of 
these  observations,  the  results  reported  here  have  not  been  optimal.  Instead, 
they  show  that  the  controller  can  yield  useful  performance  when  applied  to  a 
non- trivial  task. 

As  a  malfunction  recovery  system,  the  controller  requires  that  a  solution 
exists  for  each  malfunction  situation.  In  this  regard,  its  use  is  limited  to  con¬ 
trolled  processes  that  exhibit  "redundancy  of  control."  When  a  unit  fails,  the 
controller  withstands  the  failure  by  effecting  compensating  control  actions 
from  units  still  remaining  operational.  However,  because  of  the  redundancy 
of  control  requirement,  more  than  one  solution  may  exist  for  a  given  control 
task.  Consequently,  the  controller  may  not  always  discover  the  "best"  solu- 


Conclusion 


The  research  presented  in  this  thesis  shows  how  adaptive  logic  can  be 
used  to  control  a  continuous  process.  In  addition,  it  shows  how  a  self¬ 
organizing  controller  can  learn  its  task  on-line.  Self-organized  learning  is  use¬ 
ful  when  only  limited  information  is  available  a  priori ,  as  in  the  case  of  pro¬ 
cess  malfunctions. 

In  conclusion,  this  thesis  proposes  a  controller  with  two  significant  capa¬ 
bilities:  (1)  it  can  learn  its  task  on-line;  and  (2)  it  can  recover  control  even 
after  a  process  malfunction.  The  first  capability  is  not  new;  it  can  be  found 
in  controllers  developed  elsewhere  in  the  literature.  However,  nowhere  in  the 
literature  has  a  self-organizing  controller  been  developed  that  addresses  the 
issue  of  malfunction  recovery.  Herein  lies  the  contribution  of  this  work. 


References 


Andrew,  A.M.  "Prerequisites  of  Self-Organization,"  in  Com¬ 
puter  and  Information  Sciences,  J.T.  Tou  and  R.H.  Wilcox, 
Eds.  (London:  Cleaver-Hume  Press,  1964)  pp.381-91. 

Barto,  A.G.,  R.S  Sutton,  and  C.W.  Anderson.  "Neuron-Like 
Adaptive  Elements  That  Can  Solve  Difficult  Learning  Con¬ 
trol  Problems,"  Coins  Technical  Report  82-20  (Amherst: 
University  of  Massachusetts  Press)  pp.  1-53. 

Carbonell,  J.G.,  R.S.  Michalski,  and  T.  M.  Mitchell.  "An 
Overview  of  Machine  Learning,"  in  Machine  Learning:  An 
Artificial  Intelligence  Approach,  Michalski,  Carbonell,  and 
Mitchell,  Eds.  (Palo  Alto:  Tioga  Publishing  Co.,  1983)  pp. 
3-23. 

Donaldson,  P.E.K.  "Error  Decorrelation:  a  Technique  for 
Matching  a  Class  of  Functions,"  in  Proceedings  of  the  Third 
International  Conference  on  Medical  Electronics,  (1960)  pp. 
173-78. 

Fu,  K.S.  "Learning  Control  Systems— Review  and  Outlook," 
IEEE  Transactions  on  Automatic  Controls,  AC-15  (2), 
(1970)  pp.  210-21. 

Michie,  D.,  and  R.A.  Chambers.  "BOXES:  An  Experiment 
in  Adaptive  Control,"  in  Machine  Intelligence  2,  E.  Dale  and 
D.  Michie,  Eds.  (Edinburgh:  Oliver  and  Boyd,  1968)  pp. 
137-52. 

Michie,  D.,  and  R.A.  Chambers.  "‘BOXES’  as  a  Model  of 
Pattern-Formation,"  in  Towards  a  Theoretical  Biology;  1, 
Prolegomena,  C.H.  Waddington,  Ed.  (Edinburgh:  Edin¬ 
burgh  University  Press,  1968)  pp.  206-15. 


Saridis,  G.N.  Self- Organizing  Control  of  Stochastic  Systems 
(New  York:  Marcel  Dekker,  Inc.,  1977). 


Savant,  C.J.,  Jr.  Control  System  Design ,  2nd  Ed. 
(McGraw-Hill,  1968). 

Schaefer,  J.F.,  and  Cannon,  R.H.,  Jr.  "On  the  Control  of 
Unstable  Mechanical  Systems,"  in  Proceedings  of  the  Inter¬ 
national  Federation  on  Automatic  Control  1966,  paper  6C. 

Truxal,  J.G.  The  Concept  of  Adaptive  Control,"  in  Adap¬ 
tive  Control  Systems,  E.  Mishkin  and  L.  Braun,  Jr.,  Eds. 
(McGraw-Hill,  1961)  pp.  1-19. 

Widrow,  B.,  N.K.  Gupta,  and  S.  Maitra  "Punish/Reward: 
Learning  with  a  Critic  in  Adaptive  Threshold  Systems,"  in 
IEEE  Transactions  on  Systems,  Man,  and  Cybernetics,  Vol. 
SMC-3,  No.  5,  (Sept  1973)  pp.  455-65. 

Widrow,  B.,  and  F.W.  Smith  "Pattern-Recognizing  Control 
Systems,"  in  Computer  and  Information  Sciences,  J.T.  Tou 
and  R.H.  Wilcox,  Eds.  (London:  Cleaver-Hume  Press, 
1964)  pp.  288-317. 


END 

FILMED 

10-85 

DTIC 


